Yet Another Python Encoding Tutorial

Guillermo Moncecchi

This notebook shows how to manage text encodings with Python. It is mostly based in the Unicode HOWTO, and the codecs library module from the Python documentation. I have decided to start with these notes because handling encodings seems a pretty difficult task if you do not understand what Unicode is and how it works and how the are different ways of encode the same string.

Characters

Characters are the smallest units of a text. Usually, we think about letters, but everything that can have its own glyph (i.e., drawing), such as a number, a math symbol or a greek letter, is a character. A character table is a list of numerical values for a certain list of characters. The most ancient character table is the well known ASCII table. ASCI defined numeric codes for 128 characters (from 0 to 127), mostly including English letters and numbers, but not (for example) accented characters. ASCII characters have an important feature: they use only 7 bits, so they can be always represented by a byte. In fact, another common encoding, Latin-1 (or ISO-8859-1), uses 8 bits, encoding 256 characters using just a byte. The first 128 characters of Latin-1 are the same as characters of ASCII.

Python 2 strings are ASCII strings



In [12]:

    
import sys
print sys.getdefaultencoding()









    



ascii

If you write the following code in a Python script (let's call it encoding_test.py)

s='€';print s

... you will get an error like this:

SyntaxError: Non-ASCII character '\xe2' in file encoding_test.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

... because you are trying to specify a non-ascii character in a Python 2 string (in the next section we will see how to fix this).

Encoding

Suppose you have a text, and want to display it in your system terminal (i.e. draw it's glyph), or save it in a IO device. You will need to represent each character in your text using bits grouped in bytes (because that is all what computers can understand). This is (almost) trivial for ASCII or Latin-1: just represent the character with the numerical code, using the byte's 8 bits. But, what should we do with characters beyond Latin-1? You have no option: assign them a new code, and use more than one byte to represent it. The different forms to represent characters using bytes are called encodings (I will not delve here in how they encode each character, what I want to show is how Python manages the encoding).

The first thing we must understand is that every text document (including Python source files!) is encoded using a certain encoding. You cannot surely know the encoding inspecting the file contents, simply because the encoding is a way to transform characters to bytes... the same file will look different using different encodings. However, there are some commands (in linux, for example the enca command) that try to guess it (using "a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings"). So, generally speaking, you should better know which encoding the file you want to read uses.

For example, if this notebook shows correctly the € sign when we print a variable value, your default system Locale is using UTF-8 (which is a popular encoding). Let's verify it:



In [14]:

    
import locale
print locale.getpreferredencoding()









    



UTF-8

I have created four different text files, saving each one (using my text editor) with different encodings. Let's see how this notebook display them. First, a simple ASCII file:



In [4]:

    
!cat ../data/ASCII_file.txt









    



This is a plain ascii file. It does not matter which encoding it uses.

ASCII files will always look OK (no matter which encoding your terminal or editor is using), because (as far as I know), every encoding uses the same codes for the first 127 characters (the ASCII code). Now, let's try with Latin-1:



In [5]:

    
!cat ../data/latin_1_file.txt









    



This is a latin-1 encoded file. 
It includes some accented characters: educaci�n alegr�a c�mara. 
La tercera oraci�n est� en espa�ol.

We start to have some problems. Since our terminal uses UTF-8, and the file encodes accented symbols with Latin-1, the terminal cannot display them, because Latin-1 and UTF-8 have different encodings for characters between 128 and 255. If we encoded the file using Mac OS Roman encoding (the encoding Mac OS used to have), we will have similar problems:



In [6]:

    
!cat ../data/mac_os_roman_file.txt









    



This is an Mac OS Roman. Some accented characters: c�mara alegr�a ilusi�n

However, if we display an utf-8 encoded file, we have no problems:



In [7]:

    
!cat ../data/utf-8_file.txt









    



This is an UTF-8. 
It includes a pair of japanese characters: 東京 (Tokyo). 
The euro sign is this: €. 
Some accented characters: cámara alegría ilusión.

Remember that this happens because the terminal encoding matches the file encoding, there is no 'correct' encoding! UTF-8 is becoming more and more popular (and Python 3 strings are utf-8 encoded by default), but remember that it is just another encoding.

How could we include (for example) chinese characters in Python source? PEP-263 is the answer: you should specify an encoding as the first (or second) line of your source file:



In [8]:

    
# coding:utf-8
tokyo='東京'
print tokyo

Python is still saving ascii strings, but is converting what you wrote to bytes (using the encoding you specified). Remember, declaring the encoding does not means that your source it is actually encoded by your OS using the referenced encoding. You sould save your file in the declared format, using (for example) your text editor.

Unicode

Remember this: Unicode is not an encoding. Unicode is a character table, just like ASCII. Unicode uses 16 bits for each character code (actually, modern versions include more), i.e. you have 2^16 = 65,536 distinct values available. Python includes a library, unicodedata, to display information about unicode characters, and their numeric codes (called code points, and usually represented using base 16).



In [15]:

    
import unicodedata
u=unichr(8364)
print u,ord(u),unicodedata.category(u),unicodedata.name(u)









    



€ 8364 Sc EURO SIGN

One important (and confusing...) characteristic of Unicode, is that the first 127 characters codes are the same than those of ASCII. To tell Python that you are specifiying a Unicode string, precede the string with a u, and specify the non-ascii character using its Unicode code points value, using 4 hex digits (you can specify it with 8 hex digits, using \U):



In [10]:

    
u=u'This string includes an \u20AC sign'
print u









    



This string includes an € sign

People often confuse (just take a look at Stackoverflow) Unicode and UTF-8, but they are different things. UTF-8 (as we said) is an encoding and Unicode is a list of codes representing characters. When you build Unicode string (exactly as you did with regular strings) Python allows you to specify them using non-ascii characters if you specified an encoding:



In [11]:

    
# coding:utf-8
u=u'This string includes an € sign'
print u









    



This string includes an € sign

... but this does NOT changes its internal representation. What do happens is that utf-8 can encode every Unicode character, something that latin-1 obviously can't (why?)

The recommendation for handling character sequences in Python is to always use Unicode string for processing, to fix a standard code (that is, after all, what Unicode was invented for). The unicode constructor allows to build a unicode type from a string. If you specify no encoding, the constructor will asume you are passing an ascii-encoded string (and it will fail miserably trying to convert a character whose code is beyond 127):



In [12]:

    
u=unicode('This strings is ascii')
print u, type(u)









    



This strings is ascii <type 'unicode'>



In [13]:

    
u=unicode('This string contains \xE2\x82\xAC', encoding='utf-8')
print u,type(u)









    



This string contains € <type 'unicode'>



In [14]:

    
u=unicode('This string contains \xE2\x82\xAC')
print u,type(u)









    



---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-14-22f5b6a461ee> in <module>()
----> 1 u=unicode('This string contains \xE2\x82\xAC')
      2 print u,type(u)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 21: ordinal not in range(128)



In [15]:

    
u=unicode('This string contains €')
print u,type(u)









    



---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-15-b301343705cc> in <module>()
----> 1 u=unicode('This string contains €')
      2 print u,type(u)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 21: ordinal not in range(128)

This is a very common error. You should really be sure that you understand what is happening. In the first example, we simply pass a 8-bit string, and Python creates a unicode value, assuming ascii. In the second, we specify the Euro sign using hexa values, and tell Python that the 8-bit ascii string we passed is actually encoding sometingh using utf-8. In the third case, Python tries to create the Unicode character, but when he comes to character '\xe2' (that is, a decimal value of 226), discover that it is an invalid ascii value, and fails. This is completely independent if the encoding you are using for your source file.

Most string functions still work if you pass Unicode sequences instead of strings. The advantage is that, once you encoded the Unicode sequence, you forget about bytes and encodings. If you ask for the length of a sequence, you are asking for the number of Unicode code values, which is encoding-independent (something that does not happen with bytes, since, for characters over 256, you need more than one byte to encode them). So, the general advice is: convert your string to Unicode before working with them.

UTF-8

From the wikipedia:

UTF-8 (UCS Transformation Format—8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32. UTF-8 has become the dominant character encoding for the World Wide Web, accounting for more than half of all Web pages.

UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code space (1,114,112 code points minus 2,048 surrogate code points) using one to four 8-bit bytes (a group of 8 bits is known as an "octet" in the Unicode Standard). Code points with lower numerical values (i.e. earlier code positions in the Unicode character set, which tend to occur more frequently) are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.

Some features of the UTF-8 encoding: One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0. Code points larger than 127 are represented by multi-byte sequences, composed of a leading byte and one or more continuation bytes. The leading byte has two or more high-order 1s followed by a 0, while continuation bytes all have '10' in the high-order position. The number of high-order 1s in the leading byte of a multi-byte sequence indicates the number of bytes in the sequence, so that the length of the sequence can be determined without examining the continuation bytes. The remaining bits of the encoding are used for the bits of the code point being encoded, padded with high-order 0s if necessary. The high-order bits go in the lead byte, lower-order bits in succeeding continuation bytes. The number of bytes in the encoding is the minimum required to hold all the significant bits of the code point. Single bytes, leading bytes, and continuation bytes do not share values. This makes the scheme self-synchronizing, allowing the start of a character to be found by backing up at most five bytes (three bytes in actual UTF‑8 as explained below).

Input/Output

Most problems with encodings arise when you do input/output. The purpose of this section is to explain what happens when you read/write strings from a file, and how to avoid strange encoding errors. Let me cite the Unicode HOWTO:

Unicode data is usually converted to a particular encoding before it gets written to disk or sent over a socket. It’s possible to do all the work yourself: open a file, read an 8-bit string from it, and convert the string with unicode(str, encoding). However, the manual approach is not recommended.

One problem is the multi-byte nature of encodings; one Unicode character can be represented by several bytes. If you want to read the file in arbitrary-sized chunks (say, 1K or 4K), you need to write error-handling code to catch the case where only part of the bytes encoding a single Unicode character are read at the end of a chunk.

To read bytes from a file and convert them to Unicode characters, the codecs module implements an open() function that returns a file-like object that assumes the file’s contents are in a specified encoding and accepts Unicode parameters for methods such as .read() and .write()

Let's start by reading an ascii file, using the traditional file.open() function



In [16]:

    
ascii_file =open('../data/ASCII_file.txt')
for line in ascii_file:
        print line









    



This is a plain ascii file. It does not matter which encoding it uses.

Now, try to open the utf-8 file



In [17]:

    
utf_8_file =open('../data/utf-8_file.txt')
for line in utf_8_file:
        print line, type(line)









    



This is an UTF-8. 
<type 'str'>
It includes a pair of japanese characters: 東京 (Tokyo). 
<type 'str'>
The euro sign is this: €. 
<type 'str'>
Some accented characters: cámara alegría ilusión.
<type 'str'>

First observation. You can use open on encoded files. Open just considers them a stream of bytes. If you have the correct encoding in your terminal, the string will be displayed correctly. But beware that you are working with bytes, not with Unicode. That is not what is recommended, for the reasons previously mentioned. Let's see what happens if we try to open/display the Latin 1 file



In [18]:

    
latin_1_file =open('../data/latin_1_file.txt')
for line in latin_1_file:
        print line









    



This is a latin-1 encoded file. 

It includes some accented characters: educaci�n alegr�a c�mara. 

La tercera oraci�n est� en espa�ol.

Not working. Actually, it is working: it just reads bytes, and display them... using utf-8, not latin-1. We can try to decode each line, specifying the encoding, and converting them to Unicode:



In [46]:

    
latin_1_file =open('../data/latin_1_file.txt')
for line in latin_1_file:
        print line.decode('latin-1')









    



This is a latin-1 encoded file. 

It includes some accented characters: educación alegría cámara. 

La tercera oración está en español.

Now, let's try to read the files into Unicode, using codecs.open instead of the standard open function:



In [20]:

    
import codecs
utf_8_file =codecs.open('../data/UTF-8_file.txt', encoding='utf-8')
for line in utf_8_file:
        print type(line),line









    



<type 'unicode'> This is an UTF-8. 

<type 'unicode'> It includes a pair of japanese characters: 東京 (Tokyo). 

<type 'unicode'> The euro sign is this: €. 

<type 'unicode'> Some accented characters: cámara alegría ilusión.

Note that now the type has changed. We are working with Unicode sequences; that is, we have a format-independent representations. That is what we wanted! The codecs library includes and encode() function that (you guessed it, didn't you?) that allows to change encoding. For example, let's first encode an Unicode string using the Unicode codes for accented letters (note that, specifiying the string this way, we do not have to worry about the encoding):



In [48]:

    
s=u'Alegr\u00EDa c\u00E1mara ilusi\u00F3n'

If we want to show it, Python will use the encoding for our system (in our case UTF-8):



In [49]:

    
print s, type(s)









    



Alegría cámara ilusión <type 'unicode'>

If, for some reason, we want to encode it using Latin-1, we just use codecs.encode (which would not look very well in our display):



In [51]:

    
s1=s.encode('latin-1')
print s1,type(s1)









    



Alegr�a c�mara ilusi�n <type 'str'>

(Note that encode returns an 8-bit string, while decode returns an Unicode string). We can try to encode it using Python default encoding (ascii)... and fail



In [24]:

    
print s.encode()









    



---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-24-9435fa5dcf9b> in <module>()
----> 1 print s.encode()

UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 5: ordinal not in range(128)

By now, you should understand why this happened. There is no way to represent 'á' using ascii, simply because it is not in its code list. This happens with every encoding pairs. For example, let's try to encode the kanji for Tokyo using Latin-1:



In [35]:

    
# coding:utf-8
tokyo=u'This is 東京, boy!'
print tokyo.encode('latin-1')









    



---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-35-9b54e7ed7853> in <module>()
      1 
      2 tokyo=u'This is 東京, boy!'
----> 3 print tokyo.encode('latin-1')

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 8-9: ordinal not in range(256)

Most encoding problems are reduced to this: you cannot encode what you cannot encode. If a symbol is not present in an encoding you simply cannot encode it. The encode() function offers us some ways to manage this, through the errors parameter. The default value ('strict') produces the previous error when it can't encode a character. But we can specify 'ignore' to simply ignore those characters it cannot manage, 'replace' to use replacement character, 'xmlcharrefreplace' to use XML character reference for replacement, o 'backslashreplace' to replace with backslashed escape sequences



In [36]:

    
print tokyo.encode('latin-1','ignore')
print tokyo.encode('latin-1','replace')









    



This is , boy!
This is ??, boy!



In [37]:

    
print tokyo.encode('latin-1','backslashreplace')









    



This is \u6771\u4eac, boy!

A real world example: tagging spanish texts

In this example, I will try to use the TreeTagger to parse some text (using the treetagger Python module, based on the TagggerI class from NLTK):



In [28]:

    
# You must have NLTK installed, treetagger installed and the TREETAGGER_HOME env variable set for this to work
# export TREETAGGER_HOME='/path/to/your/TreeTagger/'
#
from treetagger import TreeTagger

Checking the documentation, we find that the TreeTagger can tag latin-1 and utf-8 texts. We must also specify the language, but this has nothing to do with the encoding, but with how the tagger was trained... First, let's tag an English text, to check that everything is working (the english version of the TreeTagger only accepts latin-1 encoded strings).



In [29]:

    
tt=TreeTagger(language='english',encoding='latin-1')
tagged_sent=tt.tag('What is the airspeed of an unladen swallow? And what about the € sign?')
print tagged_sent
print '\nReadable version:'+' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])









    



[[u'What', u'WP', u'What'], [u'is', u'VBZ', u'be'], [u'the', u'DT', u'the'], [u'airspeed', u'NN', u'airspeed'], [u'of', u'IN', u'of'], [u'an', u'DT', u'an'], [u'unladen', u'JJ', u'<unknown>'], [u'swallow', u'NN', u'swallow'], [u'?', u'SENT', u'?'], [u'And', u'CC', u'and'], [u'what', u'WP', u'what'], [u'about', u'IN', u'about'], [u'the', u'DT', u'the'], [u'\u20ac', u'JJ', u'<unknown>'], [u'sign', u'NN', u'sign'], [u'?', u'SENT', u'?']]

Readable version:What/WP is/VBZ the/DT airspeed/NN of/IN an/DT unladen/JJ swallow/NN ?/SENT And/CC what/WP about/IN the/DT €/JJ sign/NN ?/SENT

Let's see in detail what happened here. First, the tt.tag() received a Python string (what, given we are using Python 2, is an ASCII string, i.e. it uses only 7 bits for each character). Even when we specified (using our environment encoding, i.e. utf-8) a '€' sign, it is encoded using 3 bytes (the 3 bytes of its utf-8 representation):



In [30]:

    
print len('€')

See the difference between '$' (an symbol from the ascii table) and '€' (by now, you should anticipate the answer):



In [31]:

    
print len('What is the airspeed of an unladen swallow? And what about the € sign?')
print len('What is the airspeed of an unladen swallow? And what about the $ sign?')

After that, the byte stream is sent to the TreeTagger, which analizes it and returns its results. This tagger does not know of Unicode: it analyzes words (i.e. byte streams) and returns their tags, depending on its training corpus (which can be differently encoded). The python module gets this output and converts it to Unicode strings, using utf-8 (I had to check the module source code to find that). That is why the Euro sign is recovered in the output! But keep in mind that the tagger did not find a Unicode character, but three bytes when it saw the Euro symbols. Now, let's do exactly the same, but explicitly telling the tagger that we have wrote a Unicode string:



In [32]:

    
tagged_sent=tt.tag(u'What is the airspeed of an unladen swallow? And what about the € sign?')
print tagged_sent
print '\nReadable version:'+' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])









    



---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-32-43a4e2b4a7f1> in <module>()
----> 1 tagged_sent=tt.tag(u'What is the airspeed of an unladen swallow? And what about the € sign?')
      2 print tagged_sent
      3 print '\nReadable version:'+' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])

/Users/guillermo/Dropbox/fing/work/datascience/src/treetagger.pyc in tag(self, sentences)
    136 
    137         if isinstance(_input, unicode) and encoding:
--> 138             _input = _input.encode(encoding)
    139 
    140         # Run the tagger and get the output

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u20ac' in position 63: ordinal not in range(256)

Oops. Now, the Python module tries, before sending input to the tagger, to encode it (he needs to, since the Tagger is accessed through a system pipe, i.e. cannot understand Unicode), and it uses the specified encoding (latin-1). But the Euro sign is not part of the latin-1 character table, and the encoding fails. Can we circumvent this? No, we cannot. Unless not using this module. We should have an utf-8 version of the tagger, something we do not have. Let's see what happens with the utf-8 version of the spanish tagger:



In [33]:

    
tt=TreeTagger(language='spanish',encoding='utf8')
tagged_sent=tt.tag(u'¿Podremos taggear esto? ¿Y qué pasa con el signo de €? ')
print '\nReadable version:'+' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])









    



Readable version:¿/FS Podremos/VLfin taggear/VLinf esto/DM ?/FS ¿/FS Y/CC qué/INT pasa/VLfin con/PREP el/ART signo/NC de/PREP €/NC ?/FS

Now, everything is working. The tagger is receiving utf-8 encoded strings, and its results are converted back to Unicode. It does not matter if we do not specify that the input is Unicode, because our system uses utf-8, and the results are the same:



In [34]:

    
tt=TreeTagger(language='spanish',encoding='utf8')
tagged_sent=tt.tag('¿Podremos taggear esto? ¿Y qué pasa con el signo de €? ')
print '\nReadable version:'+' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])









    



Readable version:¿/FS Podremos/VLfin taggear/VLinf esto/DM ?/FS ¿/FS Y/CC qué/INT pasa/VLfin con/PREP el/ART signo/NC de/PREP €/NC ?/FS

One last test: let's tag a Bulgarian sentence (we can only use utf-8 for this, since its character codes are not within latin-1):



In [35]:

    
# There is a bug in the Python treetagger module
# Since it assumes that when a language has only one encoding, is latin-1
# Bulgarian, for example, only allows utf-8
# We tell the module that we are encoding latin
tt=TreeTagger(language='bulgarian',encoding='latin-1')
tagged_sent=tt.tag('Това е моят дом')
print '\nReadable version:'+' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])









    



Readable version:Това/Pde-os-n е/Vxitf-r3s моят/Psol-s1mf дом/Ncmsi

Now, let's try to read sentences from differently encoded files, tag them, and show their utf-8-encoded versions in out terminal. Let's first read the latin-1 file, using the codecs package, and tag its third sentence (written in Spanish):



In [36]:

    
import codecs
tt=TreeTagger(language='spanish',encoding='utf8')
f = codecs.open('../data/latin_1_file.txt', encoding='latin-1')
sents=f.readlines()
spanish_sent=sents[2]
tagged_sent=tt.tag(spanish_sent)
print type(spanish_sent),spanish_sent, ' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])









    



<type 'unicode'> La tercera oración está en español.
La/ART tercera/ORD oración/NC está/VEfin en/PREP español/NC ./FS

Now, let's process the utf-8 file the same way (of course, is a nonsense to tag English tokens with a Spanish taggers, but we want to check that the utf-8 reading works as expected:



In [37]:

    
f = codecs.open('../data/utf-8_file.txt', encoding='utf-8')
sents=f.readlines()
for sent in sents:
    tagged_sent=tt.tag(sent)
    print '\n',type(sent),sent, ' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])









    



<type 'unicode'> This is an UTF-8. 
This/NP is/PE an/PE UTF-8/NC ./FS

<type 'unicode'> It includes a pair of japanese characters: 東京 (Tokyo). 
It/NC includes/VLfin a/PREP pair/NC of/PE japanese/NC characters/PE :/COLON 東京/NC (/LP Tokyo/NP )/RP ./FS

<type 'unicode'> The euro sign is this: €. 
The/PE euro/NC sign/PE is/PE this/NC :/COLON €/NC ./FS

<type 'unicode'> Some accented characters: cámara alegría ilusión.
Some/NP accented/VLfin characters/PE :/COLON cámara/NC alegría/NC ilusión/NC ./FS

Python 3

Let us cite part of the What’s New In Python 3.0. If we did well, you should understand everything

Text Vs. Data Instead Of Unicode Vs. 8-bit

Everything you thought you knew about binary data and Unicode has changed.

Python 3.0 uses the concepts of text and (binary) data instead of Unicode strings and 8-bit strings. All text is Unicode; however encoded Unicode is represented as binary data. The type used to hold text is str, the type used to hold data is bytes. The biggest difference with the 2.x situation is that any attempt to mix text and data in Python 3.0 raises TypeError, whereas if you were to mix Unicode and 8-bit strings in Python 2.x, it would work if the 8-bit string happened to contain only 7-bit (ASCII) bytes, but you would get UnicodeDecodeError if it contained non-ASCII values. This value-specific behavior has caused numerous sad faces over the years.
As a consequence of this change in philosophy, pretty much all code that uses Unicode, encodings or binary data most likely has to change. The change is for the better, as in the 2.x world there were numerous bugs having to do with mixing encoded and unencoded text. To be prepared in Python 2.x, start using unicode for all unencoded text, and str for binary or encoded data only. Then the 2to3 tool will do most of the work for you.
You can no longer use u"..." literals for Unicode text. However, you must use b"..." literals for binary data.
All backslashes in raw string literals are interpreted literally. This means that '\U' and '\u' escapes in raw strings are not treated specially. For example, r'\u20ac' is a string of 6 characters in Python 3.0, whereas in 2.6, ur'\u20ac' was the single “euro” character. (Of course, this change only affects raw string literals; the euro character is '\u20ac' in Python 3.0.)
Files opened as text files (still the default mode for open()) always use an encoding to map between strings (in memory) and bytes (on disk). Binary files (opened with a b in the mode argument) always use bytes in memory. This means that if a file is opened using an incorrect mode or encoding, I/O will likely fail loudly, instead of silently producing incorrect data. It also means that even Unix users will have to specify the correct mode (text or binary) when opening a file. There is a platform-dependent default encoding, which on Unixy platforms can be set with the LANG environment variable (and sometimes also with some other platform-specific locale-related environment variables). In many cases, but not all, the system default is UTF-8; you should never count on this default. Any application reading or writing more than pure ASCII text should probably have a way to override the encoding. There is no longer any need for using the encoding-aware streams in the codecs module.
PEP 3120: The default source encoding is now UTF-8.